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Abstract — This paper introduces a set of algorittims for 
Monte-Carlo Bayesian reinforcement learning. Firstly, Monte- 
Carlo estimation of upper bounds on the Bayes-optimal value 
function is employed to construct an optimistic policy. Secondly, 
gradient-based algorithms for approximate upper and lower 
bounds are introduced. Finally, we introduce a new class of 
gradient algorithms for Bayesian Bellman error minimisation. 
We theoretically show that the gradient methods are sound. 
Experimentally, we demonstrate the superiority of the upper 
bound method in terms of reward obtained. However, we also 
show that the Bayesian Bellman error method is a close second, 
despite its significant computational simplicity. 

I. Introduction 

Bayesian reinforcement learning [1], [2] is the decision- 
theoretic approach [3] to solving the reinforcement learning 
problem. Unfonrtunately, calculating posterior distributions 
can be computationally expensive. Morever, the Bayes- 
optimal decision can be intractable [4], [5], [1], and even 
calculating an optimal solution in a restricted class can be 
difficult [6]. This paper proposes a set of algorithms that 
take actions by estimating bounds on the Bayes-optimal 
utility through sampling. They include a direct Monte- 
Carlo approach, as well as gradient-based approaches. We 
demonstrate the effectiveness of the proposed algorithms 
experimentally. 

A. Setting 

In the reinforcement learning problem, an agent is acting 
in some unknown Markovian environment jj, € A4, according 
to some policy tt G 11. The agent's policy is a procedure for 
selecting actions, with the action at time t being at £ A. 
The environment reacts to this action with a sequence of 
states St € S and rewards rt £ R. Since the agent may be 
learning from experience, this interaction may depend on the 
complete history, ht G where H ^ (S x A x R)* is the 
set of all state action reward sequences. 

The complete Markov decision process (MDP) is specified 
as follows. The agent's action at time t depends on the history 
observed so far: 



at I ht = (s*, r*, a*-i) - P"(at [ s*, r*, a*"!), 



(1) 



where s* is a shorthand for the sequence (si)*=i; similarly, 
we use for {si)l^f.. We denote the environment's response 
at time t + 1 given the history at time t by: 

St+i,rt+i I ht = {s\r*,a*) ~ P^(sf+i, rf+i | St,at) (2) 

Finally, the agent's goal is determined through its utility: 



which is a discounted sum of the total instantaneous rewards 
obtained, with 7 G [0, 1]. Without loss of generality, we 
assume that U G [0, [/max]- The optimal agent policy 
maximises U in expectation, i.e. 



maxE'^ U, 



(4) 



'^,E^ denote probabilities and expectations under 



U 



00 



(3) 



where 

the process jointly specified by p,, tt. However, as in the 
reinforcement learning problem the environment /i is un- 
known, the above maximisation is ill-posed. Intuitively, the 
agent can increase its expected utility by either: (i) Trying to 
better estimate p in order to perform the maximisation later 
(exploration), or (ii) Use a best-guess estimate of jj. to obtain 
high rewards (exploitation). 

In order to solve this trade-off, we can adopt a Bayesian 
viewpoint [3], [7], where we consider a (potentially infinite) 
set of environment models A4. In particular, we select a prior 
probability measure ^ on A4. For an appropriate subset B C 
Ai, the quantity (,{B) describes our initial belief that the 
correct model lies in B. We can now formulate the alternative 
goal of maximising the expected utility with respect to ^: 

EtU = max EI U = max / (E^^ U) d£.{p). (5) 
^ ^ Jm 

This makes the problem formally sound. A policy 7r| G 

argmax^E^f/ is called Bayes-optimal as it solves the 

exploration-exploitation problem with respect to out prior 

belief ^. However, its computation is generally hard [8] 

even in restricted classes of policies [6]. On the other 

hand, simple heuristics such as Thompson sampling [9], [1] 

provide an efficient trade-off [10], [11] between exploration 

and exploitation. 

B. Related work and our contribution 

One difficulty that arises when adopting a Bayesian 
approach to sequential decision making is that in many 
interesting problems, the posterior calculation itself requires 
approximations, mainly due to partial observability [12], [4], 
The second and more universal problem, which we consider 
in this paper, is that finding the Bayes-optimal policy is hard, 
as the set of policies we must consider grows exponentially 
with the horizon T. However, heuristics exist which, given 
the current posterior, can obtain a near-optimal policy [13], 
[14], [1], [15], [6], [16]. In this paper we shall focus on 
model-based algorithms that use approximate lower and 
upper bounds on the Bayes-optimal utility to select actions. 

The general idea of computing lower and upper bounds 
via Monte-Carlo sampling in model-based Bayesian rein- 
forcement learning was introduced in [5]. This sampled MDP 



models from the current belief to estimate stochastic upper 
and lower bounds. These bounds were then used to perform 
a stochastic branch and bound search for an optimal policy. 
In a follow-up paper [6], an attempt was made to obtain 
tighter lower bounds by finding a good memoryless policy. 
An earlier class of approaches involving lower bounds is the 
work of [16], which sampled beliefs rather than MDPs to 
construct lower bound approximations. 

In order to perform the approximations, we also introduce 
a number of gradient-based algorithms. Relevant work in this 
domain includes the Gaussian process (GP) based algorithms 
suggested by [17], [18] and [19]. In particular, [17] performs 
an incremental temporal-difference fit of the value function 
using GPs, implicitly using the empirical model of the pro- 
cess. The other two approaches are model-based, with [18] 
estimating a gradient direction for policy improvement by 
drawing sample trajectories from the marginal distribution. 
An analytic solution to the problem of policy improvement 
with GPs is given by [19], which however relies on the 
expected transition kernel of the process and so does not 
appear to take the model uncertainty into account. 

The approaches suggested in this paper are considerably 
simpler, as well as more general, in that they are appli- 
cable to any Bayesian model of the Markov process and 
parametrisation of the value function. The fundamental idea 
stems from the observation that, in order to estimate the 
Bayes-utility of a policy, we can draw sample MDPs from 
the posterior, calculate the (either current policy's, or the 
optimal) utility for each MDP and average. The same effect 
can be achieved in an iterative procedure, by drawing only 
one MDP, estimating the utility of our policy, and then 
adjusting our parameters to approach the sampled utility. 
This can be achieved with gradient methods. Finally, we use 
the same sampling idea to minimise the Bellman error of the 
Bayes-expected value function, in a fully incremental fashion 
that explicitly takes into account the model uncertainty. 

II. Gradient Bayesian reinforcement learning 

Imagine that the history ht E H of length t has been 
generated from P^, the process defiend by an MDP E A4 
controlled with a history-dependent policy tt. Now consider a 
prior belief on Ai with the property that ^o(' I i") = Co(')' 
i.e. that the prior is independent of the policy used. Then the 
posterior probability, given a history ht generated by a policy 
TT, that fi E B can be written as: 



^tiB\7r)^^oiB\ht,7r)^ 



^P-(fet)dgo(/i) 



(6) 



Fortunately, the dependence on the policy can be removed, 
since the posterior is the same for all policies that put non- 
zero mass on the observed data. Thus, in the sequel we shall 
simply write £_t for the posterior probability over MDPs at 
time t. 

A. Value functions 

Value functions are an important concept in reinforcement 
learning. Briefly, a value function : S R gives the 



expected utility for the policy tt acting in an MDP /i, given 
that we start at state s, i.e. 



v;{s)^E;{u\st^s). 



(7) 



A similar notion is expressed by the Q-value function : 
5 X ^ — 5- R, which is the expected utility for the policy tt 
acting in an MDP /i, given that we start at state s and take 
action a, i.e. 



QZis, a) = El{U \ St ^ s,at ^ a). 



(8) 



Similarly, and with a slight abuse of notation, we define 
the Bayesian value function : S ^ R, and the related 
Bayesian Q-value function : 5 x ^ — > R. These are 
defined for any belief ^ and policy tt to be the corresponding 
expected value functions over all MDPs. 



M 



(9) 
(10) 



M 



Due to the convexity of the Bayes-optimal expected util- 
ity [3] with respect to the belief ^, it can be bounded from 
above and below also for the Bayesian RL problem [5]: 



M 



max(E^ U) d^ifi) > EJ(C/) > (C/), Vtt' E H. 

(11) 

Since it hard to find the Bayes-optimal policy [8], [20], [3], 
[5], we may instead try and estimate upper and lower bounds 
on the expected utility, and consequently, on the Q-value 
function. These can then be used to implement a heuristic 
policy that is either exploratory (when we use upper bounds) 
or conservative (when we use lower bounds). 

To achieve this, we propose a number of simple algo- 
rithms. First, we describe the direct upper bound estimation 
proposed in [5] in the context of tree search. Here, we apply 
it to select a policy directly, in a manner similar to the 
lower bound approach in [6]. We then describe gradient- 
based incremental versions of both algorithms. However, all 
of these algorithms require estimating the value function 
of a sampled MDP, a potentially expensive process. For 
this reason, we also derive a gradient-based algorithm for 
minimising the Bayes-value function Bellman error This is 
shown to perform almost as well as the previous algorithms, 
with significantly less computational effort. 

B. Direct upper bound estimation 

The idea of the following algorithm stems directly from 
the definition of the upper bound (|5]l. In fact, [5] had 
previously used such upper bounds in order to guide tree 
search, while [6] had used lower bounds directly for taking 
actions. However, to our knowledge the simple idea of 
estimating the upper bound (|5]l and using it to directly take 
actions has never been tried before in practice. 



We can estimate an upper bound value vecto^^ q by direct 
Monte Carlo sampling from our belief ^: 



9s,a 



1 ^ 

N ^ 

i=l 



As, a), 



(12) 



Mil • • ■ : fJ-k^'^ 6 
1 ~ W ^i=l Qpii 



This idea is significantly simpler than that constructing cred- 
ible intervals (see for example [22]). In addition, estimation 
of Q* . for each sampled MDP is easy. This is in contrast 
with the lower bound approach advocated in [6]. 

Algorithm 1 U-MCBRL: Upper-bound Monte-Carlo 
Bayesian RL 

Input prior ^q, value vector q, initial state sq, number of 
samples N . 
for i = 0, ... do 
if switch-policy tlien 

// Sample N MDPs 
// Get Q upper bound. 

end if 

at = argmax^g^ (7s.a- // Act in the real MDP 

St+i, Tt+i ~ // Observe next state, reward 

= Ct(- I st+i, rt+i, St, at) II Update posterior 

end for 

The algorithm is presented in AlgIT] A hyperparameter of 
the algorithm is the number of samples N to take at each 
iteration, as well as the points at which to switch policjU. 
This paper uses the simple strategy of linearly incrementing 
the switching interval. Let us now see how we can directly 
approximate both lower bounds such as those in [6], and 
upper bounds, such as this in Alg. [T] via gradient methods. 

C. Direct gradient approximation 

We now present a simple class of algorithms for gradient 
Bayesian reinforcement learning. First, let us consider the 
estimation for a specific policy tt, which will correspond to 
approximating a lower bound. Define a family of functions 
ve : S IR, {vg \ 6 E Q}. We consider the problem of 
estimating the expected value function given some belief ^: 



/W- / 9i9;s)dxis), (13) 
Js 

g{e;s)^\\vg{s)-V[is)\\ (14) 

where x is a measure on S, and || • || is the Euclidean norm. 
Then the derivative of (fl4] l can be written as: 



VegiO; s) = 2 {ve{s) - ¥[{3)) Veve{s) 



(15) 



Let cjfc(s) = be the value function of an MDP 

sampled from the belief, i.e. /i^ ~ ^. Then, due to the 
linearity of expectations, it is easy to see that: 



Veg{0; s) = [2 {vg{s) - ujk{s)) Veve{s)] 



(16) 



'For continuous spaces, this can be defined on a set of representative 
states. 

^Due to tile Hoeffding bound [21] and tlie boundedness of the vaiue 
function, it is easy to see that this estimate is 0(l/\/7V)-close to the upper 
bound (TT) with high probability. 

'since re-sampling and calculating new value functions is expensive 



Consequently, ujk can be used to obtain the following 
stochastic approximation [23], [24] algorithm 

Ok+i = Ok- Ilk {veis) - LOkis)) Veveis), (17) 

where rjk must be a step-size parameter satisfying ^f^rjk = 
00, < cxo. A similar approach can be used to estimate 

the Q-value function with an approximation qo : 5 x ^ — > R: 

Ok+i ^ Ok - rjk {qe{s, a) - ujk{s, a)) Veqe{s, a), (18) 

where ujk{s,a) = Q'^^{s,a). This update can also be 
performed over the complete state-action space 

9k+i=0k-VkJ2^kis,a), (19) 

s.a 

Dkis, a) = {qeis, a) - ookis, a)) Veqe{s, a). (20) 

The same procedure can be applied to approximate the 
upper bound (fTTT i. This only requires a trivial modification 
to the above algorithms, by setting ajfe(s) = or 
ijJk{s,a) — Q*^^{s,a) in either case. It is easy to see that 
the above approximation still holds. 

Algorithm 2 DGBRL: Direct gradient Bayesian RL. 
Input prior parameters 6*0, initial state sq 
for t = 0, ... do 

fit ^ £,t II Sample an MDP 

ujt — QJn (or Q*) // Get value of sample 

Ot+i ^Ot-rjk Y.s,a Dk{s, a) II Update parameters 
at — argmaXjjg_4 qet{s, a). II Act in the real MDP 
St+i, Tt+i ^ fJ, II Observe next state, reward 

= Ct(- I st+i,rt+i,st,at) II Update posterior 

end for 

To make the approximation faster, we can take a single 
MDP sample at every step, take an action, and then use 
the previous approximation for the next step. If the belief 
^t changes sufficiently slowly then this will be almost as 
good as taking multiple samples and finding the best approx- 
imation at every step. The complete algorithm is shown in 
Alg 12] The advantage of this idea over the upper and lower 
bound approach advocated in [5], [6], is that we can re- 
use information from previous steps without needing to take 
multiple MDP samples. 

In either case, the computational difficulty is the calcula- 
tion of V^^, which we still need to do once at every step. 
The next section discusses another idea where the complete 
estimation of a value function for each sampled MDP is not 
required. 

D. Temporal difference-like error minimisation 

One alternative idea is to simply estimate a consistent 
value function approximation, similar to those used in 
temporal-difference (TD) methods (in particular the gradient- 
based view of TD-like methods in [24]). The general idea is 



to form the following minimisation problem: 

min/(0), f{e)^ [ g{e;s)dx{s), (21) 

g{e;s)^ [ \\h{0;fi,s)\\dafi) (22) 
Jm 

h{0; /i, s) ^ vois) - pis) - 7 Lis') dP;(s' | s). (23) 

Js 

Now let us sample a state ^ x from the state distribution, 
an MDP /i^ ^ ^ from the belief and a next state sj. ~ 
P^^ (s' I Sk) from the transition kernel of the sampled MDP 
given the sampled current state. Using the euclidean norm 
for II • II and taking the gradient with respect to 6 we obtain: 

Dk = 2/i(6'fe;^fe,Sfe) (Vefe,(sfc) -7V0We,(4)) (24) 
ek+i=9k-VkDk. (25) 

By choosing an appropriate approximation architecture, e.g. 
a linear approximation with bounded bases, the following 
corollary holds: 

Corollary 1 // ||Vewe|| < c and ||VgWe|| < c' with c,c' < 
oo, then f{Ok) converges, with limt^oo ^efi^k) = 0. 

Proof: This results follows from Proposition 4.1 
in [24], since the the sequence satisfies the four con- 
ditions in Assumption 4.2. (a) / > 0. (b) / is twice 
differentiable and its second derivative is bounded, as 
\\J^Vlve{s')dPlis'\s)\\ < Js\\Vlveis')\\dF;is' \ s) < 
c. (c) By taking expectations over the sample, it is easy to 
see that KDk = '^ef{Ok)- (d) follows from the boundedness 
of the first derivative. ■ 

E. Bellman error minimisation 

An alternative formulation is Bellman error minimisation 
([24], Sec. 6.10), where instead of minimising the error with 
respect to the current policy, we minimise the error over 
the Bellman operator applied to the current value function. 
This is simplest to do when we are working with Q-value 
functions. Then the problem can be written as: 

min/(0), fie) = V / 9iO; s, a) dx(s), (26) 

gi0;s,a) ^ f ||/i(0; s, a)|| d^(M) (27) 
Jm 

hie- fi, s, a) ^ qeis, a) ~ pis) - 7 Lis', a*(s')) dP;:(.s' | s) 

Js 

(28) 

a* is') ^ argmaxg0(s', a'). (29) 

a'eA 

Using the same reasoning as in Sec. lII-Dl we sample s^ ^ 
l^k ^ £. from the belief and a next state sj. ^ ^Jiki^' I ^k) 
from the transition kernel of the sampled MDP given the 
sampled current state. Using the euclidean norm for || H and 
taking the gradient with respect to 6 we obtain the algorithm: 

Dk = 2hi9k; pk, Sk,ak)[Vet,qeisk,ak) (30) 
-l^eqeAs'k,a*is'k))] 
Ok+i^Ok-rnDk. (31) 



It easy to see that Corollary [T] is also applicable to this 
update sequence. When the state sequence is generated from 
a particular policy, rather than being drawn from some 
distribution we obtain Algl3] 



Algorithm 3 BGBRL: Bellman gradient Bayesian RL 
Input prior ^q, parameters 6*0, initial state sq 
for t = 0, ... do 

pt ^ £.t II Sample an MDP 

s'l ~ P^^(st+i I Sf) // Sample a next state 

6*4+1 = Ot — rjtDt II Update parameters using (l30t 
at — argmaXjjg_4 qetis, a). II Act in the real MDP 
St+i, Tt+i ^ p II Observe next state, reward 

6+1 (■) = 6(- I st+i,rt+i,st,at) II Update posterior 
end for 



III. Experiments 

We present experiments illustrating the performance of 
U-MCBRL and BGRL and compare them with other al- 
gorithms. In particular we also examine the lower-bound 
algorithm presented in [6], the well known UCRL [25] 
algorithmO and (5(A), for completeness. 

A. Experiment design 

Since all algorithms have hyperparameters, we followed 
a principled experiment design methodology. Firstly, we 
selected a set of possible hyperparameter values for each 
algorithm. For each evaluation domain, we performed 10 
runs for each hyperparameter choice and chose the one 
with the highest total reward over these runs. We then 
measured the performance of the algorithms over 10"^ runs. 
This ensures an unbiased evaluation. 



Methods 


parameter 


function 


Q(A) 


£0 


exploration 


UCRL 


<5 


confidence interval 


MCBRL, U-MCBRL 


N 


number of samples 


BGBRL, Q(A) 


Vo 


step size 



TABLE I: Automatically tuned hyperparameters 



The set of hyper-parameters that were automatically tuned 
for each method are listed in Table I] For (5(A), we fixed 
A = 0.9 and used an e-greedy strategy with a decaying rate 
and tuned initial value eo. For UCRL, we tune the inter- 
val error probability 6. Gradient algorithms require tuning 
the initial step-size parameter 770. Monte-Carlo algorithms 
require tuning the number of samples N. UCRL, MCBRL 
and U-MCBRL all used the same policy-switching heuristic. 

B. Domains 

We employed standard domains from discrete-state 
problems in exploration in reinforcement learning. Thus, 
Bayesian inference is closed-form, as we can use a Dirichlet- 
product prior for the transitions and a Normal-Gamma prior 

^Although UCRL is defined for undiscounted problems, it is trivial to 
apply to discounted problems by adding replacing average value iteration 
with discounted value iteration. 



for the reward. Value function parametrisation is tabular, 
i.e. there is one parameter per state-action pair These do- 
mains are the Chain problem [1], River-Swim [26], Double- 
Loop [1]. In addition, we consider the mountain car domain 
of [27], using a uniform 5x5 grid as features. All domains 
employed a discount factor 7 = 0.99. 



Chain 





1993.9 


1999.7 


2005.4 


3 


UCRL 


3543.5 


3547.5 


3551.3 


1613 


MCBRL 


3610.5 


3616.1 


3621.7 


464 


U-MCBRL 


3617.8 


3623.4 


3629.1 


1560 


BGBRL 


3593.6 


3598.3 


3602.7 


48 


Double Loop 


Q(A) 


2053.7 


2058.1 


2062.1 


5 


UCRL 


3841.0 


3841.0 


3841.0 


369 


MCBRL 


3949.5 


3950.2 


3951.0 


2343 


U-MCBRL 


3946.7 


3947.5 


3948.3 


5135 


BGBRL 


3925.3 


3926.2 


3927.0 


96 


River Swim 


Q(A) 


5.0 


5.0 


5.0 


5 


UCRL 


312.4 


313.8 


315.3 


240 


MCBRL 


624.0 


625.4 


626.8 


1187 


U-MCBRL 


626.3 


627.6 


629.0 


2329 


BGBRL 


600.3 


601.7 


603.2 


69 


Mountain Car 5x5 


Q(A) 


-9957.6 


-9957.0 


-9956.3 


15 


UCRL 


-9952.9 


-9951.6 


-9950.3 


1908 


MCBRL 


-9829.1 


-9827.2 


-9825.5 


35733 


U-MCBRL 


-9811.8 


-9810.2 


-9808.6 


66252 


BGBRL 


-9883.2 


-9881.9 


-9880.6 


886 


Method 


95% lower 


mean 


95% upper 


CPU (s) 



TABLE II: Total reward and CPU time 



C. Results 

From the online performance results shown in Fig [T] it 
is clear that apart from (5(A), all algorithms are performing 
relatively similarly in the simpler environments. However, 
UCRL converges somewhat more slowly and is particularly 
unstable in the Mountain Car domain. |f| 

A clearer view of the performance of each algorithm 
is can be seen in Table [III in terms of the average total 
reward obtained. It additionally shows the 95% lower and 
upper confidence bound calculated on the mean (shown 
in the middle column) via lO"* bootstrap samples. The 
best-performing methods in each environment (taking into 
account the bootstrap intervals) are shown in bold. One 
immediately notices that MCBRL and U-MCBRL are usually 
tied for best. This is perhaps not surprising, as they have 
the same structure: in fact, for = 1, they are equivalent 
to Thompson sampling [1], as mentioned in [6]. However, 
MCBRL uses a lower bound on the value function, while U- 
MCBRL an upper bound, which makes it more optimistic]^ 

The most significant finding, however is that BGBRL 
is a relatively close second most of the time, performing 
better than all the remaining algorithms. This is despite its 
computational simplicity. 

'Due to the discretisation, this domain is no longer fully observable. 

''Although we did not explicitly consider Thompson sampling, we note 
that the hyperparameter N = 1 coiTesponding to Thompson sampling was 
never chosen by the automatic procedure as it always had worse performance 
than taking more samples. Nevertheless, its performance over the 10'^ runs 
was always significantly worse than those of MCBRL and U-MCBRL. 



IV. Conclusion 

This paper introduced a set of Monte-Carlo algorithms for 
Bayesian reinforcement learning. The first, U-MCBRL is a 
modification of a lower-bound algorithm to an upper-bound 
setting, which has very good performance but has relatively 
high computational complexity. The second, DGBRL, is a 
type of gradient-based algorithm for approximating either 
the lower or the upper bound, but nevertheless does not 
necessarily alleviate the problem of complexity. Finally, BG- 
BRL defines a novel type of Bellman error minimisation, on 
the Bayes-expected value function. By performing gradient 
descent to reduce this error through sampling possible MDPs, 
we obtain an efficient and highly competitive algorithm. 

The algorithms were tested using an unbiased exper- 
imental methodology, whereby hyperparameters were au- 
tomatically selected from a small number of runs. This 
ensures that algorithmic brittleness is not an issue. In all 
of those experiments, U-MCBRL and its sibling, MCBRL 
outperformed all alternatives. However, BGBRL was a close 
runner-up, even though it is computationally much simpler, 
as it does not require performing value iteration. 

A subject that this paper has not touched upon is the 
theoretical performance of U-MCBRL and BGBRL. For the 
first, the results for MCBRL [6] should be applicable with 
few modifications. The performance analysis of BGBRL-like 
algorithms, on the other hand, is a completely open question 
at the moment. 
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